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As  knowledge ‘ increases  about  human  judgment  processes,  it  is  natural  to 
suppose  that  it  will  be  possible  to  use  this  knowledge  in  order  to  improve 
human  judgment  in  situations  where  biases  of  various  sorts  have  been  shown 
to  occur.  Despite  the  reasonableness  of  this  expectation,  judgmental  de¬ 
biasing  has  proven  extraordinarily  difficult  in  most  cases.  This  paper  sug¬ 
gests  that  the  reason  for  this  failure  is  that  debiasing  must  be  in  terms  of 
the  procedures  that  are  actually  used  in  the  act  of  judging,  procedures 
about  which  very  little  is  known.  Two  experiments  are  presented  that 
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illustrate  how  such  procedural  debiasing  can  be  used  to  debias  a  Bayesian 
inference  task.  In  the  first  experiment ,  a  training  procedure  is  used  that 
corrects  a  ro— ion  error  in  the  direction  of  the  adjustment  process  that 
subjects  use  when  integrating  later  evidence  with  earlier  partial  judgments.  I 
In  the  second  procedure  a  focusing  technique  is  used  to  improve  the  relative 
weighting  of  samples  in  the  overall  judgment.  Each  of  the  procedures  accom¬ 
plishes  its  particular  end,  and  taken  together  the  two  procedures  allow 
naive  subjects  to  produce  judgments  that  are  essentially  Bayesian.  These 
results  are  discussed  in  terms  of  a  theoretical  model  of  the  judgment  process 
in  which  four  basic  stages  are  repeated  cyclically:  (a)  initial  scanning 
of  the  stimulus  information;  (b)  selection  of  items  for  processing  in  order 
of  importance;  (c)  extraction  of  scale  values  on  the  given  dimension  of 
judgment;  and  (d)  adjustment  of  a  composite  value  that  summarises  already- 
processed  components. 
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As  increasingly  sore  is  known  about  hitman  judgment  processes,  it  becomes 
reasonable  to  expect  that  this  knowledge  can  be  used  to  help  people  make  better 
Judgments.  This  is  particularly  true  in  situations  where  failures  of  judgment 
seem  to  be  orderly  manifestations  of  the  processing  mechanisms  used  by  the  judge 
and  not  merely  the  random  errors  that  might  be  attributed  to  inattention,  insuffi- 
cent  knowledge,  faulty  memory  and  similar  nonsystematlc  factors,  unfortunately, 
however,  it  has  been  easier  to  imagine  such  Improvement  than  to  produce  it 
(Fischhoff,  1982). 

Probably  the  first  attempts  at  debiasing  human  judgments  were  aimed  at 
reducing  the  tendency  of  naive  subjects  in  Bayesian  inference  tasks  to  produce 
judgments  that  are  "conservative**  relative  to  the  Bayesian  norm  (Edwards,  1968) . 

In  discussing  these  early  debias lng  efforts,  it  is  useful  to  rely  on  e  classifica¬ 
tion  scheme  devised  by  Fischhoff  (1982)  in  which  debiasing  procedures  are  cate¬ 
gorised  according  to  whether  they  lay  the  blame  for  the  bias  at  "the  doorstep  of 
the  judge,  the  task,  or  some  mismatch  between  the  two"  (p.  424). 

Allegations  that  a  task  is  faulty  generally  canter  on  the  possible  failure 
of  the  experimenter  to  instill  in  subjects  sufficient  understanding  of  the  task 
and  sufficient  motivation  for  proper  performance.  In  the  case  of  Bayesian 
inference,  Phillips  and  Edwards  (1966)  used  specialised  payoff  schemes  and  feed¬ 
back  in  order  both  to  encourage  subjects  to  try  harder  and  to  help  them  better 
understand  the  task.  Generally  speaking,  these  methods  had  some  effect  in  reducing 
conservatism,  but  they  were  not  able  to  eliminate  it. 

A  second  task  fault  that  was  investigated  involved  a  potential  bias  in  the 
response  scale.  The  argument  ran  that  "correct"  performance  in  Bayesian  tasks 
often  requires  the  production  of  extreme  responses,  particularly  when  the  judgments 
must  be  given  on  a  probability  scale.  If  subjects  are  hesitant  to  make  such 
extrosM  judgments,  conservatism  can  result.  Phillips  and  Edwards  (1966)  tested 
this  hypothesis  by  comparing  judgments  on  probability  scales  with  judgments  on 
"odds"  and  "log  odds"  scales  which  require  less  extreme  responding.  They  found 
that  use  of  response  scales  such  as  these  reduced  conservatism  only  slightly 
relative  to  the  more  conventional  probability  scale. 


A  more  recant  attempt  at  debiasing  falls  in  Fischhoff *s  (1982)  category 
of  attributing  the  bias  to  a  mismatch  between  the  task  and  the  judge.  Ells, 
Seaver,  and  Edwards  (1977)  based  their  procedure  on  the  observation  that  the 
judgments  of  naive  subjects  in  Bayesian  tasks  are  often  more  like  averages 
(or  estimates  of  population  proportion)  than  like  inferences,  (Beach,  Wise,  & 
Barclay,  1970;  Marks  &  Clarkson,  1972,  1973;  Shanteau,  1970,  1972).  This  being 
so.  Ells  et  al.  hypothesised  that  subjects  might  be  better  at  judging  the  mean 
log  likelihood  ratio  for  a  set  of  samples  than  at  judging  the  more  standard 
cumulative  log  likelihood  ratio.  They  also  noted  that  the  averaging  response 
would  reduce  problems  of  "response  bias"  if  there  were  any  operating. 

The  hypothesis  was  tested  by  using  two  groups  of  subjects,  one  of  which 
rated  their  average  certainty  for  the  target  hypotheses  and  the  other  of  which 
rated  their  cumulative  certainty.  Responses  from  both  groups  were  then  converted 
to  log  posterior  odds  form  and  regression  analysis  was  performed  for  each  subject 
comparing  inferred  log  posterior  odds  to  veridical  log  posterior  odds.  The 
results  supported  the  hypothesis:  log  odds  inferred  from  average  certainty  judg¬ 
ments  were  definitely  closer  to  veridical  than  odds  inferred  from  cumulative 
certainty  judgments. 

The  present  research  represents  an  attempt  at  deblaslng  that  falls  in 
Flschhoff's  remaining  category,  that  of  attributing  error  to  faulty  judges. 
Like  the  work  of  Ells  et  al.  (1977),  the  research  begins  with  the  observation 
that  untutored  subjects  in  Bayesian  tasks  tend  to  produce  data  that  are  more 
like  averages  than  like  inferences.  But  unlike  the  approach  of  Ells  et  al., 
no  attempt  Is  made  to  "engineer"  the  task  to  be  better  suited  to  human  proclivi¬ 
ties.  Instead,  deblaslng  Involves  (a)  analyzing  the  procedures  that  untutored 
subjects  use  when  they  produce  averages,  (b)  warning  subjects  about  the  specific 
procedures  that  are  inappropriate,  and  (c)  providing  subjects  with  appropriate 
procedures  that  can  be  used  in  place  of  the  inappropriate  procedures. 

Averaging  and  Adjustment  in  Bayesian  Inference 

Bayesian  inference  tasks  are  usually  instantiated  in  tezms  of  the  "bookbags 
and  poker  chips"  paradigm  in  which  subjects  consider  two  well-specified  hypo¬ 
theses  (l.e.,  bookbags)  usually  involving  populations  of  binary  events  (i.e., 
red  and  blue  poker  chips).  Typically  the  subject  is  shown  two  or  more  samples, 
often  sequentially,  and  is  asked  after  each  sample  to  rate  the  strength  of  his 


or  her  belief  about  which  population  generated  the  samples. 

According  to  Bayes'  theorem,  the  normative  response  for  such  situations 
Is  found  by  multiplying  the  prior  odds  ratio  for  the  two  hypotheses  by  the  like¬ 
lihood  ratio  of  the  sample  data  given  the  two  hypotheses.  This  yields  the 
posterior  odds  ratio: 


»(hi|d)  _  p(d|hi)  x  p(hi) 


p(H2|D)  p(D|  H2)  X  p(H2) 


(1) 


Alternatively,  one  can  write  Bayes'  theorem  to  give  the  probability  of  a  parti¬ 
cular  hypothesis: 


p(HllD)  »  p(pl Hlj  *p(Hl) 

PV  1  ;  p(D|Hl).p(Hl)  +  p(D|H2)  -p(H2) 


(2) 


Note  that  in  these  equations,  the  relationship  between  current  data  and  previous 
data  is  multiplicative. 

How  do  naive  humans  perform  when  they  are  asked  to  provide  inferences  in 
Bayesian  tasks?  As  has  already  been  mentioned,  human  Inferences  differ  from 
Bayesian  inferences  in  two  important  ways:  (a)  the  individual  judgments  are 
typically  conservative  relative  to  the  Bayesian  norm,  and  (b)  the  pattern  of 
judgments  is  suggestive  more  of  averaging  or  estimation  than  of  Inference  (Beach, 
Wise,  &  Barclay,  1970;  Marks  A  Clarkson,  1972,  1973;  Shanteau,  1970,  1972). 
Shanteau  (1970)  hypothesised  that  people's  judgments  in  such  tasks  could  be 
modeled  by  an  algebraic  rule  in  which  the  response,  R,  at  any  serial  position, 
n,  is  given  by  a  weighted  average  of  the  scale  values,  s^,  of  the  previous  and 
current  sample  events: 


R  ■  I  w.s. 

“S.  i-o  A  A 


(3) 


In  this  equation  the  w^  are  weights  that  sum  to  unity  and  the  term  signifies 
the  weight  and  scale  value  of  a  neutral  initial  impression.  It  should  be  noted 
that  averaging  is  necessarily  conservative  relative  to  inference  because  averages 
always  lie  within  the  range  of  the  component  stimulus  values  whereas  inferences 
are  often  more  extreme  than  any  of  their  component  values . 


Shaateau's  model  la  succesaful  in  accounting  for  the  quantitative  features 
of  the  data,  but  It  does  not  suggest  either  why  or  how  averaging  occurs. 

In  previous  research  (Lopes,  1981;  Lopes  &  Johnson,  1982;  Lopes  &  Oden,  1980) 

I  have  suggested  that  averaging  may  occur  because  subjects  integrate  the 
stimulus  information  serially  via  an  "anchoring  and  adjustment"  process  (Tversky 
&  Kahneman,  1974) .  In  this  process  subj  ects  are  hypothesized  to  Integrate  "new" 
information  into  "old”  composite  judgments  by  adjusting  the  old  value  as  necessary 
to  make  the  new  composite  lie  somewhere  between  the  old  composite  and  the  value 
of  the  new  information.  Although  this  process  is  qualitatively  equivalent  to 
averaging,  it  does  not  presuppose  that  subjects  ever  "compute"  an  average  in  any 
algebraic  —  or  even  any  conscious  —  sense  of  the  term.  Instead,  averaging  is 
simply  the  natural  consequence  of  the  adjustment  procedure. 

One  prediction  of  the  adjustment  model  is  that  subjects  in  the  Bayesian 
task  will  occasionally  make  adjustments  that  are  strictly  in  the  wrong  direction. 
Consider  two  samples,  both  of  which  support  the  same  hypothesis  but  to  different 
degrees.  If  a  subject  la  first  shown  the  weaker  sample,  we  suppose  that  some 
weak  preliminary  judgment  will  be  made  in  favor  of  the  supported  hypothesis. 

When  the  subject  is  later  shown  the  stronger  sample,  adjustment  will  be  made  In 
the  direction  of  Increased  support  for  the  hypothesis.  This  is  entirely  appro¬ 
priate  qualitatively.  But  if  the  sasqples  are  reversed  so  that  the  weaker  sample 
follows  the  stronger,  qualitatively  inappropriate  adjustment  ought  to  result. 

That  is,  the  preliminary  judgment  ought  to  produce  a  relatively  strong  result. 

When  the  weaker  sample  is  later  integrated  into  the  judgment,  adjustment  should 
be  in  the  neutral  direction  since  the  value  of  the  weaker  sample  is  more  neutral 
than  the  preliminary  judgmsnt.  Such  adjustment  is  obviously  inappropriate  since 
movement  in  the  neutral  direction  is  de  facto  movement  toward  the  alternative  or 
nan-supported  hypothesis. 

Previous  research  (Lopes,  1981)  has  clearly  supported  the  prediction  that 
subjects  will  adjust  in  the  normatively  Incorrect  direction  when  a  weaker 
sample  favoring  some  particular  hypothesis  follows  a  stronger  sample  favoring 
the  same  hypothesis.  The  present  research  is  aimed  at  finding  out  whether  these 
"directional  errors"  can  be  eliminated  by  training  that  warns  subjects  of  the 
occurrence  of  the  errors  and  also  teaches  subjects  an  alternative  procedure  that 
is  directionally  correct. 

Two  experiments  are  presented.  The  first  experiment  focuses  on  improving 
subjects'  adjustment  procedures  qualitatively  in  specific  cases  where  adjustment 


errors  are  known  to  occur.  The  second  experiment  extends  the  training  to 
include  instruction  In  a  select ional  procedure  that  is  hypothesized  to  improve 
subjects'  quantitative  performance. 

Experiment  1 

Method 

Experimental  task.  Subjects  in  both  conditions  were  asked  to  put  them¬ 
selves  In  the  place  of  a  machinist  whose  job  is  to  make  decisions  concerning  the 
maintenance  of  milling  machines  using  samples  of  parts  produced  by  the  machines. 
The  judgment  concerns  whether  or  not  a  critical  spring  has  broken  inside  the 
machine.  Subjects  were  told  that  normal  machines  have  a  rejection  rate  of  about 
12  parts  per  1000  parts  produced  (H12/1000) ,  whereas  machines  with  broken  springs 
have  a  rejection  rate  of  about  20  parts  per  1000  (H20/1000) .  Thus,  in  abstract 
terms,  the  subjects  were  required  to  decide  between  alternate  Bernoulli  processes, 
one  with  £  ■  .012  and  the  other  with  £  ■  .02 ,  with  £  being  the  probability  of  a 
rejected  pert. 

Stimulus  design.  The  stimulus  design  was  a  9  x  9,  first-sample  x  second- 
sample,  factorial  design  in  which  the  levels  of  both  factors  comprised  the  same 
samples  of  parts.  These  were  12,  13*  14,  IS,  16,  17,  18,  19,  mid  20  rejects  per 
1000  parts,  respectively. 

Procedure.  Subjects  were  run  individually  in  sessions  that  took  about 
40  minutes  for  control  subjects  and  SO  minutes  for  trained  subjects.  At  the 
beginning  of  the  session  subjects  were  brought  into  a  sound  proof  booth  and 
seated  in  front  of  a  computer  controlled  video  terminal.  Control  subjects  were 
then  given  general  Instructions  about  the  nature  of  the  task  and  shown  how  to 
read  the  stimulus  display.  A  sample  of  a  stimulus  display  is  shown  in  Figure  1. 

At  the  top  of  the  display  is  a  box  showing  a  sample  with  13  rejects  out  of 
1000  parts.  Uhder  this  is  a  notation  showing  that  this  is  the  first  of  two 
samples.  At  the  bottom  of  the  display  is  a  response  scale  anchored  at  the  left 
by  the  words  "machine  normal"  and  at  the  right  by  the  words  "machine  faulty". 

Figure  1  about  here 

The  procedure  for  each  trial  was  identical.  Subjects  read  the  information 
for  the  first  sample  and  then  rated  their  degree  of  belief  as  to  whether  the 


■•chin*  vu  Billing  normally  or  noC.  They  made  their  ratings  using  a  joystick 
to  aove  the  rating  arrow  (shown  in  the  middle  of  the  scale  in  Figure  1)  along 
the  response  scale.  When  they  finished  their  initial  rating,  subjects  pushed 
a  button  on  the  response  box.  This  caused  the  initial  rating  to  be  transmitted 
to  the  coaputer  and  also  caused  the  first  sample  to  be  replaced  by  a  second 
settle  of  parts  from  the  same  machine.  Subjects  revised  their  initial  rating 
to  account  for  the  new  sample  and  pushed  the  response  button  to  transmit  their 
final  response  to  the  computer.  Then  they  Initiated  the  next  trial  by  returning 
the  response  arrow  to  the  middle  of  the  scale  and  pushing  the  button  again. 

The  Instructions  for  trained  subjects  were  essentially  identical  to  those 
for  control  subjects  through  the  explanation  of  the  stimulus  display  and  the 
rating  response,  except  that  tralnad  subjects  were  told  at  the  outset  that  they 
would  be  taught  a  procedure  for  avoiding  a  common  judgment  error.  The  actual 
training  took  place  during  the  early  practice  trials.  The  first  practice  trial 
was  a  weak-strong  pair  (17/19)  that  was  chosen  especially  to  elicit  correct 
responses  fron  all  subjects.  For  this  trial,  all  subjects  initially  rated  a 
sanple  of  17  rejects  to  favor  the  faulty  machine  moderately  and  then  adjusted 
this  rating  to  favor  the  faulty  machine  even  more  strongly  after  presentation 
of  the  sample  of  19  rejects. 

The  second  trial  was  a  strong-weak  pair  (13/14)  chosen  to  elicit  the  direc¬ 
tional  error.  On  this  trial  subjects  were  shown  the  first  sample  (13  rejects) 
end  allowed  to  make  thair  initial  rating  and  to  transmit  their  response.  Then 
they  were  shown  the  second  sample  (14  rejects)  end  were  allowed  to  make  their 
adjustment,  but  they  were  stopped  before  they  transmitted  the  response.  Most  of 
the  subjects  (20  of  31)  made  their  adjustment  in  the  wrong  direction  and  were 
read  the  instructions  reproduced  below.  The  others  were  read  similar  instructions 
but  with  wording  changed  to  accoanodate  the  fact  that  they  had.  In  fact,  responded 
correctly  on  this  trial. 

Before  you  transmit  your  response,  let  me  talk  with  you  about 
your  response.  You  shouldn't  feel  bad,  but  remember  I  told 
you  that  many  people  make  an  error  in  this  task.  Well,  you 
just  made  it.  Let  me  explain  it  to  you.  Most  people,  if  they 
are  given  a  sample  of  14  rejected  parts  as  a  first  sample,  say 
that  the  machine  is  more  likely  to  be  functioning  normally  than 
not.  But  when  they  are  given  a  sample  of  14  rejected  parts 
after  they  have  just  been  given  a  sample  of  13  rejected  parts. 
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they  tend  to  adjust  their  judgment  toward  the  right,  that  is, 
toward  the  machine  being  broken.  Now  if  you  think  about  it, 
this  is  an  error  of  adjustment  since  a  sample  of  14  rejects 
favors  the  normal  machine  and  therefore  provides  additional 
evidence  that  the  machine  is  normal.  Thus,  the  adjustment 
should  be  toward  the  left,  that  is,  toward  the  machine  func¬ 
tioning  normally.  Do  you  understand  this  so  far? 

After  subjects  indicated  that  they  understood  what  the  error  was,  the  experi¬ 
menter  taught  them  a  simple  procedure  for  avoiding  the  error.  Basically,  this 
was  to  separate  each  judgment  operation  into  two  steps:  (a)  the  labeling  of  each 
sample  as  either  favoring  the  "normal"  hypothesis  or  the  "faulty"  hypothesis 
and  (b)  the  adjustment  of  the  current  response  in  the  direction  given  by  the 
label.  (It  is  convenient  to  think  of  the  initial  rating  produced  by  the  subject 
after  presentation  of  the  first  sample  as  involving  an  adjustment  made  to  an 
earlier  and  implicit  "neutral"  response  produced  by  the  subject  at  the  onset  of 
each  new  trial.)  Thus,  when  both  first  and  second  samples  favored  the  same 
hypothesis,  both  the  initial  rating  and  the  final  adjustment  would  be  made  in 
the  same  direction  relative  to  the  neutral  point,  and  only  when  the  second 
sample  favored  a  hypothesis  different  from  the  first  would  the  final  adjustment 
be  opposite  in  direction  to  the  initial  rating. 

After  teaching  subjects  the  judgment  procedure,  the  experimenter  asked  them 
to  respond  to  several  trials  on  their  own,  while  verbalizing  what  they  were 
doing.  This  allowed  the  experimenter  to  check  that  they  were  explicitly  separa¬ 
ting  the  labeling  and  the  adjustment  steps  and  that  they  were  adjusting  at  each 
step  in  the  direction  given  by  the  labeling  operation.  Among  these  training 
trials  were  two  for  which  these  samples  were  identical.  When  the  first  such 
trial  (17/17)  appeared,  the  experimenter  waited  to  see  whether  the  subject  would 
adjust  for  the  second  sample  and  then  stopped  the  trial  for  further 
instruction.  Subjects  who  had  failed  to  adjust  (8  out  of  31)  were  told,  "Now 
this  kind  of  trial  also  causes  errors.  Let  me  explain.  Your  first  sample  was 
17  rejects  and  you  judged  the  machine  as  likely  to  be  broken.  Then  you  got 
new  evidence  of  17  rejects  also  favoring  the  machine  being  broken,  but  you  didn't 
adjust.  Actually  you  should  have  adjusted  since  that  is  additional  evidence 
in  favor  of  the  machine  being  broken.  Do  you  see  what  I  mean?"  Subjects  who 
had  adjusted  correctly  were  read  similar  instructions,  but  modified  to  accord 
with  their  correct  response. 


It  Is  important  to  note  that  the  training  procedure  Involved  on1 y 
qualitative  features  of  the  judgment  process.  At  no  time  were  subjects  given 
instruction  concerning  how  they  ought  to  evaluate  the  sample  Information 
quantitatively.  Although  such  training  might  be  helpful  generally,  the  aim 
of  the  present  research  was  to  determine  the  degree  to  which  judgment  can  be 
improved  by  strictly  procedural  means,  that  is,  by  giving  subjects  better 
procedures  for  operating  on  information  rather  than  by  giving  them  better  or 
more  accurate  information. 

Altogether  there  were  13  trials  for  practice  and  training  for  the  trained 
subjects.  Control  subjects  received  the  same  13  trials  for  practice,  but  with 
no  training.  Then  both  groups  of  subjects  received  two  replications  of  the 
stimulus  design,  bringing  the  total  number  of  trials  to  175  per  subject.  Experi¬ 
mental  trials  within  each  replication  were  ordered  randomly  but  with  the  restric¬ 
tion  that  no  sample  appear  either  as  first-sample  or  second-sample  on  two 
consecutive  trials. 

Subjects.  The  subjects  for  control  and  trained  groups  were,  respectively, 

30  and  31  student  volunteers  from  the  University  of  Wisconsin-Madlson .  Approxi¬ 
mately  half  were  males  and  half  females.  They  served  for  credit  to  be  applied 
to  their  course  grades  in  introductory  psychology. 

Results  and  Discussion 

Two  questions  are  of  interest  in  this  experiment.  The  first  is  whether 
training  concerning  directional  adjustment  errors  can  prevent  or  at  least  reduce 
their  prevalence  in  the  inference  task.  The  second  is  whether,  given  that  such 
prevention  or  reduction  of  errors  is  possible,  this  leads  to  improvement  in  the 
accuracy  of  the  final  judgments. 

Data  bearing  on  the  first  question  are  given  in  Table  1.  Five  subjects 
have  been  dropped  from  the  control  condition  and  five  from  the  trained  condition 
since  these  subjects  appeared  to  base  their  final  judgments  entirely  on  the 
second  sample.  Note,  however,  that  the  basic  results  of  the  experiment  would 
have  been  the  same  whether  these  subjects  were  retained  or  not. 

Subjects  were  unanimous  in  treating  samples  of  12  to  15  rejects  per  1000 
parts  as  favoring  the  machine  being  normal  and  samples  of  17  to  20  rejects 
as  favoring  the  machine  being  broken,  but  they  were  highly  variable  in  how 
they  treated  samples  of  16  rejects.  (Actually,  such  samples  favor  slightly  the 
machine  being  broken.)  Some  subjects  ’’ended  to  treat  these  as  neutral,  others 


Created  them  as  favoring  one  or  the  other  hypothesis,  and  still  others  treated 
then  inconsistently  across  trials.  For  reasons  of  this  variability,  pairs 
involving  16  rejects  are  not  considered  explicitly  in  the  formal  analysis. 

However,  an  interesting  problem  involving  these  samples  that  occurred  for  some 
subjects  is  described  in  the  General  Discussion . 

Table  1  about  here 

Taken  together,  there  were  20  pairs  in  which  adjustment  errors  might  have 
been  expected.  These  comprised  the  eight  pairs  along  the  diagonal  of  the 
stimulus  design  in  which  the  two  samples* are  identical  (i.e.,  12/12,  13/13, 

14/14,  15/15,  17/17,  18/18,  19/19,  20/20)  and  the  twelve  non-diagonal  pairs 
in  which  a  (stronger)  sample  favoring  a  particular  hypothesis  is  followed  by 
a  weaker  sample  favoring  the  ■mss  hypothesis  (i.e.,  12/13,  12/14,  12/15,  13/14, 
13/15,  14/15,  20/19,  20/18,  20/17,  19/18,  19/17,  18/17).  These  pairs  are 
indicated  in  the  table  as  "diagonal*4  pairs  and  "strong-weak"  pairs,  respectively. 
The  table  also  gives  results  for  the  set  of  "weak-strong"  pairs.  These  are 
exactly  the  same  set  as  the  strong-weak  pairs  except  that  the  stronger  sample 
in  each  pair  is  preceded  by  the  weaker.  Since  for  these  pairs  the  intuitive 
direction  of  adjustment  is  normatlvely  correct,  they  provide  an  estimate  of  the 
rate  of  adjustment  errors  that  occur  for  reasons  other  than  the  incompatibility 
of  the  normative  response  with  the  intuitive  direction  of  adjustment  (i.e., 
misreading  the  stimulus). 

Looking  first  at  strong-weak  pairs  and  weak-strong  pairs,  it  is  clear  that 
the  training  procedure  has  bean  effective  in  reducing  the  number  of  directional 
adjustment  errors,  where  "error”  refers  to  an  explicit  adjustment  in  the 
nonnormatlve  direction.  (Including  as  errors  occasions  on  which  no  adjustment 
was  made  would  have  produced  essentially  the  same  results.)  For  the  control 
group  there  is  an  average  of  13.4  errors  per  subject  (out  of  24  maximum)  for 
strong-weak  pairs  compared  to  an  average  of  only  .40  errors  per  subject  for 
weak-strong  pairs;  F(l,24)  -  60.43,  j>  <  .05).  For  the  trained  subjects,  however, 
the  average  is  3.11  errors  for  strong-weak  pairs  compared  to  .42  errors  for 
weak-strong  pairs;  F(l,25)  •  8.33,  £  <  .05.  Comparing  across  groups,  the  trained 
subjects  have  significantly  fewer  errors  then  control  subjects  for  strong-weak 
pairs  [F(l,49)  -  113.46,  £  <  .05]  but  not  for  weak-strong  pairs  [F  <  1]. 

The  final  row  of  the  table  gives  the  results  for  diagonal  pairs.  Adjustment 
errors  have  been  scored  for  these  pairs  only  if  there  was  room  for  the  adjustment 
to  occur  (i.e.,  the  response  was  not  already  at  the  end  of  the  scale)  and  if  there 
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was  either  no  adjustment  at  all  or  adjustment  in  the  wrong  direction.  (Errors 
of  the  latter  type  were  very  rare.)  Mean  errore  (out  of  a  maximum  of  16) 
were  3.24  for  the  control  group  and  1.08  for  the  trained  group.  Both  these 
error  rates  are  significantly  different  from  zero  [£(1,24)  ■  26.18  and  £(1,25) 
-  8.99,  respectively,  £  <  .05],  and  the  rate  for  the  control  group  is  signi¬ 
ficantly  higher  than  that  for  the  trained  group  [£(1,49)  -  10.29,  £<  .05]. 

In  general,  it  appears  that  the  training  procedure  was  able  to  reduce 
(although  not  completely  to  eliminate)  direction*!  adjustment  errors,  parti¬ 
cularly  for  strong-weak  pairs.  The  question  remains,  however,  as  to  whether 
this  reduction  was  accompanied  by  improved  accuracy  of  judgment  (i.e.,  reduced 
conservatism).  Figure  2  gives  the  final  judgment  data  for  the  control  group 
pooled  over  both  subjects  and  replications.  For  purposes  of  comparison. 

Figure  3  gives  theoretical  values  for  an  optimal  Bayesian  judge.  In  Figure 
2,  the  data  for  pairs  where  errors  are  likely  (i.e.,  strong-weak  pairs 

and  diagonal  pairs)  are  shown  by  filled  symbols  and  the  data  for  remaining 
pairs  are  shown  by  open  symbols.  The  row  parameter  in  both  cases  is  number 
of  rejects  in  the  second  sample. 


Figures  2  and  3  about  here 


It  is  clear  graphically  that  there  is  a  large  difference  between  the  data 
pattern  produced  by  the  control  subjects  and  the  theoretical  pattern:  The 
theoretical  data  have  a  "barrel"  shape  whereas  the  control  data  look  more  like 
a  set  of  parallel  lines.  This  appearance  is  borne  out  by  analysis  of  variance: 
Although  the  data  for  control  subjects  have  a  significant  interaction  [£(64,1536) 

■  2.66,  £  <  .05),  it  accounts  for  only  .7 X  of  the  total  systematic  sum  of  squares 
By  way  of  contrast,  analysis  of  variance  on  the  theoretical  values  indicates 
that  the  Interaction  should  account  for  4.66%  of  the  systematic  sum  of  squares. 

The  data  for  the  trained  subjects  are  in  Figure  4.  Overall,  the  figure 
presents  the  same  appearance  as  that  for  the  control  group,  although  the  inter¬ 
action  term  [£(64,1600)  ■  4.98,  £  <  .05)  is  somewhat  larger,  accounting  for 
1.2%  of  the  systematic  sum  of  squares. 


Figure  4  about  here 


Although  Figure  4  gives  all  the  data,  the  points  that  are  critical  for 
the  training  procedure  are  just  those  that  are  filled.  Comparison  of  these 
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critical  pairs  for  control  and  trained  subjects  shows  that  subjects  who  had 
received  training  were,  indeed,  more  accurate  in  their  judgments  for  these 
points.  Figured  on  group  means,  the  root-mean-squared-deviations  between 
obtained  and  theoretical  were,  for  the  control  group,  .0723  for  strong-weak 
pairs  and  .0227  for  diagonal  pairs,  relative  to  the  0-1  response  scale. 

For  the  trained  subjects,  however,  these  values  were  .0187  and  .0047, 
respectively. 

The  data  for  the  critical  pairs  are  about  what  might  be  expected  given 
the  nature  of  the  training,  but  an  unexpected  finding  is  that  improved  perfor¬ 
mance  on  strong-weak  pairs  generalized  to  the  corresponding  weak-strong  pairs: 
Although  the  training  procedure  did  not  in  any  way  attesq>t  to  modify  subjects' 
procedures  for  judging  weak-strong  pairs,  trained  subjects  did  about  as  well 
on  these  (RMSD  -  .0220)  as  they  did  on  the  strong-weak  pairs.  In  the  same 
way,  the  control  subjects  did  about  as  poorly  on  weak-strong  pairs,  KMSD  « 

.0626,  as  they  did  on  the  strong-weak  pairs. 

This  generalization  of  improved  accuracy  from  strong  weak  to  weak-strong 
pairs  is  of  interest  since  it  suggests  that  the  training  instruction*  may  have 
been  effective  not  only  in  helping  subjects  avoid  the  specific  adjustment  error, 
but  also  in  helping  them  understand  the  task  better.  Although  the  present  data 
do  not  speak  to  the  issue  directly,  previous  data  showing  that  the  judgments  of 
naive  subjects  are  more  like  estimates  of  population  proportion  than  they  are 
like  inferences  (Beach,  Wise,  &  Barclay,  1970;  Marks  &  Clarkson,  1972,  1973; 
Shanteau,  1970,  1972)  suggests  that  subjects  may  have  difficulty  understanding 
the  difference  between  inference  and  estimation.  By  focusing  attention  on  the 
directional  errors  in  Inference  that  occur  for  stxong-weak  pairs,  one  nay  also, 
by  serendipity,  focus  attention  on  the  special  characteristics  that  distinguish 
inference  from  estimation  and  hence,  improve  subjects'  understanding  of  the  task. 

But  if  trained  subjects  do  understand  the  inference  process  better  than 
control  subjects,  why  do  their  data  show  the  same  tendency  toward  parallelism? 

Put  another  way,  why  are  their  inferences  so  conservative  for  those  hetero¬ 
geneous  pairs  (shown  in  the  upper  left  and  lover  right  quadrants  of  the  figures) 
in  which  the  two  samples  favor  different  hypotheses?  The  answer  to  this  nay 
lie  in  the  weights  that  subjects  give  to  the  various  samples. 

Consider  a  situation  that  is  like  the  currant  one  except  that  subjects 
are  actually  instructed  to  estimate  the  proportion  of  rejected  parts  for  a 
particular  machine.  If  the  two  samples  are  of  equal  size  and  equal  reliability. 


the  subject  ought  to  give  them  equal  weight  and  simply  average  the  values. 
Furthermore,  no  matter  what  value  a  particular  sample  has  (i.e.(  whether  the 
first  sample  Is  12,  14,  16  or  any  other  number  of  rejects),  the  value  Itself 
should  not  affect  the  weight  of  the  sample  In  the  overall  judgment.  A  subject 
who  followed  such  a  "constant  weighting"  strategy  would  produce  a  parallel 
pattern  of  data  such  as  Is  found  In  Figures  2  sad  4. 

The  Bayeslsn  task,  however,  requires  that  subjects  adopt  a  "differential 
weighting"  strategy:  Samples  that  are  extreme  (i.e.,  12  or  20)  are  more 
diagnostic  than  samples  that  are  nearer  neutral  (l.e.,  IS  or  17),  and  should 
be  weighted  more  heavily  In  the  Inference  process.  But  this  Is  what, 
apparently,  subjects  do  not  naturally  do  In  the  Bayesian  task  (l.e..  Beach, 

Wise,  6  Barclay,  1970;  Shanteau,  1970)  or  In  a  great  many  other  tasks  as 
well  (c.f.  Anderson,  1974).  Thus,  It  may  be  that  subjects  In  the  trained 
condition  do  understand  the  Bayesian  task  better  than  their  control  condition 
analogs,  at  least  In  the  sense  that  they  are  really  Integrating  evidence  and 
not  merely  Integrating  sample  sizes,  but  they  may  not  understand  that  the  more 
extrema  estimates  are  more  diagnostic  and  hence  should  be  accorded  greater 
weight.  For  homogeneous  pairs  In  which  both  samples  favor  the  same  hypothesis, 
such  a  misunderstanding  would  not  be  likely  to  Impair  accuracy  much  since 
subjects'  responses  are  forced  to  converge  (just  as  they  ought  to)  by  the  end 
of  the  responee  scale.  For  heterogeneous  pairs,  however,  the  misunderstanding 
Is  more  serious  since  there  is  nothing  to  prevent  subjects  from  making  overly 
large  adjustments  given  only  weakly  diagnostic  Information,  thus  causing  the 
poor  correspondence  between  theoretical  and  obtained  for  these  particular  pairs. 

Experiment  2  Investigates  whether  this  hypothesized  problem  with  Intuitive 
weighting  of  Information  can  be  alleviated  by  a  modification  of  the  training 
procedure  used  In  Experiment  1. 

Experiment  2 

Method 

Task  and  design.  The  task  for  Experiment  2  was  exactly  like  the  task  for 
Experiment  1  except  that  the  two  samples  within  each  pair  were  presented 
simultaneously.  The  stimulus  design  was  the  same  as  had  been  used  in  Experiment  1. 

Procedure .  The  procedure  for  the  control  subjects  was  essentially  the  same 
as  for  Experiment  1  except  for  the  differences  occasioned  by  the  simultaneous 


stimulus  display.  However,  the  Instructions  for  the  trained  subjects  were 
more  detailed  and  were*  applied  to  every  kind  of  stimulus  pair.  When  trained 
subjects  were  first  brought  into  the  experiment  they  were  told  that  they  would 
be  taught  a  simple  procedure  that  would  allow  them  to  make  good  judgments  In 
a  particular  kind  of  Inference  task.  Then  they  were  told  about  the  task  situa¬ 
tion  (i.e.,  the  machine  maintenance  problem)  and  were  Instructed  how  to  read 
the  stimulus  display  and  use  the  response  device.  The  actual  training  began 
only  after  it  was  dear  that  the  subjects  understood  the  stimulus  situation. 

Subjects  were  taught  a  four  step  procedure  to  be  applied  to  every  stimulus 
pair.  The  steps  were  Introduced  to  subjects  mud  explained  as  the  subjects 
worked  through  a  series  of  practice  trials.  During  this  training  period  subjects 
were  asked  to  work  through  the  steps  out  loud  so  that  the  experimenter  could 
check  on  their  understanding  and  use  of  the  procedure.  The  steps  were  as 
follows: 

(a)  Judge  for  each  sample  separately  whether  it  supports  the  "normal" 
hypothesis  or  the  "faulty"  hypothesis  or  whether  it  is  neutral. 

(b)  Decide  which  of  the  two  samples  supports  its  own  hypothesis  more 
strongly. 

(c)  Make  an  initial  rating  as  to  whether  the  machine  is  faulty  or  not 
based  only  on  the  stronger  of  the  two  pieces  of  evidence.  If  both 
pieces  are  equally  strong,  either  can  be  used  as  the  basis  for  the 
Initial  rating. 

(d)  Adjust  the  Initial  rating  In  order  to  take  Into  account  the  second, 
(weaker)  place  of  evidence. 

(1)  If  the  second  piece  of  evidence  favors  the  same  hypothesis  as 
the  first,  then  "consider  the  portion  of  the  response  scale 
between  [the]  original  rating  and  the  [appropriate]  end  of  the 
scale  and  move  the  arrow  Into  this  region  according  to  how 
strong  the  remaining  evidence  is." 

(li)  If  the  second  piece  of  evidence  favors  the  opposite  hypothesis, 
than  "consider  the  portion  of  the  rating  scale  between  [the] 
original  rating  mid  the  neutral  position  and  adjust  back  Into 
this  region  according  to  how  strongly  the  sample  [supports  the 
other  hypothesis] ." 

Vote  that  although  the  procedure  sounds  complicated  when  summarized,  it  was 
much  simpler  to  follow  In  the  context  of  actual  stimulus  pairs.  No  subject 
appeared  to  have  any  great  difficulty  In  following  the  procedure  during  training 
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or  in  executing  the  teak  afterward. 

Altogether  there  were  20  trials  for  practice  and  training  for  the  trained 
subjects.  Control  subjects  received  the  same  20  practice  trials,  but  with  no 
training.  Than  both  groups  of  subjects  received  two  replications  of  the  basic 
stimulus  design,  bringing  the  number  of  trials  to  182  per  subject.  Experi¬ 
mental  trials  within  replication  were  ordered  randomly  but  with  the  restriction 
that  no  given  sample  appear  on  two  consecutive  trials.  The  full  experiment 
required  about  40  minutes  for  control  subjects  and  about  55  minutes  for  trained 
subjects. 

Subjects.  The  subjects  were  56  student  volunteers  from  the  University  of 
Wisconsin — Madison ,  split  evenly  between  the  control  and  the  trained  conditions. 
About  half  were  males  and  half  females.  Most  subjects  served  for  pay,  although 
a  few  served  for  credit  to  be  applied  to  their  ccurse  grade  in  introductory 
psychology. 

Results  and  Discussion 

The  data  for  the  control  subjects  are  given  in  Figure  5  pooled  over  both 
subjects  and  replications.  Samples  designated  "first"  appeared  above  the  other 
sample  in  the  simultaneous  display. 

Note  that  the  pattern  of  judgments  is  essentially  identical  to  that  of  the 
control  subjects  in  Experiment  1.  This  visual  similarity  is  confirmed  by  an 
analysis  of  variance  showing  that  the  interaction,  although  significant,  £(64,1728) 
•  1.81,  £  <  .05,  accounts  for  only  .42Z  of  the  systematic  sum  of  squares. 
Calculation  of  the  root -mean- squared- deviat ions  between  theoretical  and  obtained 
reveals  an  overall  RMSD  of  .1043  for  the  entire  data  array,  which  breaks  down  to 
RMSD's  of  .0968  for  homogeneous  cells,  .1156  for  heterogeneous  cells,  and  .0322 
for  diagonal  cells. 

Figure  5  about  here 


The  data  for  the  trained  subjects  are  in  Figure  7.  Clearly,  the  training 
procedure  has  been  effective  in  making  the  subjects'  responses  more  optiauil. 

In  terms  of  analysis  of  variance,  the  interaction  [£(64,1728)  ■  33.89,  £  <  .05] 
now  accounts  for  4.552  of  the  syetematlc  sum  of  squrres  compared  to  the  optimal 
value  of  4.66Z.  The  overall  RMSD  between  theoretical  and  obtained  is  .0480, 
which  breaks  down  to  RMSD's  of  .0321  for  homogeneous  cells,  .0567  for  hetero¬ 
geneous  cells,  and  .0077  for  diagonal  cells.  Although  deviations  for  heterogeneous 


cells  ars  still  somewhat  larger  than  those  for  homogeneous  cells,  they  are  much 
improved  compared  to  those  for  the  control  group.  It  Is  also  Interesting  to 
note  that  the  largest  deviations  between  theoretical  and  obtained  now  tend  to 
involve  overly  radical  responses.  This  is  perticularly  evident  for  homogeneous 
pairs  In  which  one  sample  was  12  rejects  and  the  other  was  near  neutral  (15  or 
16  rejects).  In  part  these  error*  reflect  the  feet  that  subjects  tended  to 
treat  the  judgment  task  syumetrlcally,  when  samples  of  12  rejects  actually  gave 
considerably  less  support  to  the  hypothesis  H12/1000  than  samples  of  20  rejects 
gave  tc  the  hypothesis  H20/1000. 

Figure  6  about  here 


General  Discussion 

Before  proceeding  to  a  discussion  of  the  implications  of  the  present 
research,  it  is  important  to  point  out  exactly  what  the  training  procedures 
did  and  did  not  "teechH  the  subjects.  Obviously  there  would  be  little  interest 
in  showing  that  subjects  can  learn  to  use  Bayes'  theorem  if  they  are  given 
explicit  instruction  on  how  to  do  so.  Debiasing  becomes  of  interest  only  If 
It  Is  possible  to  modify  subjects'  predilections  by  procedures  that  are  closer 
to  natural  modes  of  thought  than  is  the  rote  application  of  an  appropriate 
normative  rule.  In  other  words,  the  goal  la  to  educate  the  intuition,  not 
merely  to  improve  the  performance. 

In  Experiment  1,  the  training  procedure  taught  the  subjects  only  one 
thing  that  previously  they  did  not  know,  namely,  that  adjustments  of  the  Initial 
rating  made  after  presentation  of  the  second  saaple  should  always  be  in  the 
direction  of  the  hypothesis  favored  by  the  second  saaple.  In  Experiment  2,  the 
explicit  training  included  the  fame  information  about  adjustment  direction 
but  also  taught  subjects  to  process  the  two  samples  in  order  of  their  apparent 
relative  strength.  At  no  time  in  either  training  procedure  did  the  experimenter 
teach  the  subjects  anything  about  which  aamples  favored  which  hypothesis  or 
how  strongly  they  did  so,  nor  did  she  suggest  how  diagnostic  or  "weighty"  the 
aamples  should  be  considered  to  be.  These  matters  of  sample  evaluation  were 
always  left  entirely  to  the  subjects. 

In  light  of  the  limited  training  to  which  subjects  were  exposed,  the 
amount  of  debiasing  that  occurred  is  impressive.  In  Experiment  1,  explicit 


training  was  directed  only  at  the  12  strong-weak  pairs  and  the  8  diagonal  pairs. 
It  would  have  been  entirely  within  reason  for  subjects'  responses  to  the  other 
61  pairs  to  be  unaffected  by  the  training  since,  so  far  as  was  indicated,  there 
was  nothing  wrong  with  their  intuitions  concerning  such  pairs.  As  it  turned  out, 
however,  laprovenant  generalised  free  the  strong-weak  pairs  to  analogous  weak- 
strung  pairs.  Obviously,  there  is  no  way  to  know  for  sure  why  this  inprovaasnt 
occurred,  but  a  possibility  that  has  appeal  is  that  the  training  focused 
subjects'  attention  on  the  inferential  nature  of  the  task  end  prevented  the 
apparently  coion  tendency  to  fall  into  judging  the  sanple  proportion  rather 
then  the  relative  likelihood  of  the  two  hypotheses.  Thus,  trained  subjects 
nay  have  benefit ted  not  only  froa  prior  instruction  concerning  how  to  prevent 
a  particular  error,  but  also  by  being  forced,  so  to  speak,  to  better  understand 
what  it  was  they  were  judging. 

There  was,  however,  for  some  trained  subjects  in  interesting  failure  of 
generalisation  for  certain  pairs  la  which  a  diagnostic  saaple  (l.e. ,  a  saaple 
favoring  ana  or  the  other  hypothesis)  was  followed  by  a  sanple  that  the  subject 
judged  to  be  neutral  or  nondiagnostic  (l.e.,  a  saaple  of  16  rejects).  As  was 
noted  earlier,  there  was  considerable  variability  long  subjects  in  how  they 
evaluated  sanples  of  16  rejects.  Nevertheless ,  9  control  subjects  and  16  trained 
subjects  aeanad  reliably  to  produce  initial  ratings  of  about  .50  whan  a  saaple 
of  16  rejects  appeared  as  the  first  saatple.  Thus,  for  these  subjects  we  can 
assune  that  such  sanples  were  Judgsd  to  be  neutral.  Whan  such  saaqiles  followed 
diagnostic  sanples,  however,  all  of  the  control  subjects  and  all  but  5  of  the 
trained  subjects  adjusted  their  initial  ratings  toward  neutral,  which  is 
noznatively  inappropriate  given  that  the  sanple  is  judged  to  support  neither 
hypothesis.  For  the  control  subjects,  such  errors  are  not  surprising  (c.f. 
Shanteau,  1975;  Troutnan  4  Shanteau,  1977) .  But  the  question  is  why  so  many 
trained  subjects,  if  they  really  understood  the  task  better  than  control 
subjects,  slso  uade  the  inappropriate  adjustnsnt.  The  answer  nay  lie  in  how 
these  subjects  interpreted  the  label  "neutral."  Ideally,  a  subject  who 
applies  the  label  "neutral"  to  the  second  of  two  sanples  will  interpret  this 
as  providing  ssro  support  for  either  hypothesis  and  hence  will  sake  no  adjustnent 
of  the  initial  response.  But  if  subjects  do  not  recognise  that  "neutral"  weans 
"aero  support,"  they  nay  interpret  the  sanple  as  evidence  for  another  hypothesis, 
nanaly,  that  the  nachine  is  neither  clearly  normal  nor  clearly  broken,  and 
adjust  toward  the  scale  position  (l.e.,  the  uidpoiat)  that  best  scene  to  signify 


this  third  hypothesis. 

Za  bperlnant  2  tbs  extent  of  dsbissiac  ws  svsa  more  remarkable,  psrtlcu- 
lsrly  when  one  rscslls  the  nany  previous  unsuccessful  efforts  that  hove  been 
directed  et  reducing  conservetisa  In  the  Bayesian  task.  In  evaluating  this 
result,  it  is  laportsnt  to  understand  that  there  was  nothing  In  the  Instructions 
that  would  have  prevented  subjects  froa  continuing  to  weight  infonaatlon 
Ofaally  regardless  of  diagnosticity.  That  is,  the  subjects  were  only  Instructed 
to  begin  their  judgpant  using  the  M stronger"  ssaple;  they  were  not  Instructed 
to  give  it  ante  weight  la  the  final  judgaent.  Nevertheless,  the  net  effect  of 
the  aaaipulatlon  was  for  judge  aits  to  closely  approx laate  opt Inal  values. 

Whether  this  occurred  because  subjects  Intended  to  give  the  aore  laportsnt 
stlaulus  aore  weight  is,  of  course,  not  clear.  One  night  argue  that  the 
laproved  weighting  pattern  occurred  due  to  unintentional  prlaacy  effects  that 
were  outside  of  the  subjects'  coaprshansloa  of  the  task.  But  it  is  worth 
noting  that  In  natural  judgeent  situations,  people  often  "put  first  things  first," 
considering  those  ltens  that  are  dssaad  to  be  laportsnt  before  they  conelder 
other,  lass  laportsnt  ltoas.  Thus,  oca  strategy  for  differential  weighting  any 
be  exactly  to  attend  to  iteas  la  order  of  iaportsnce  end  to  aahe  saaller  end 
Mailer  adjust aenta  for  Iteas  that  are  of  lesser  end  lesser  Iaportsnce. 

What  Do  Judges  Do? 

for  aore  then  20  years,  evidence  has  bean  accuaulatlng  that  huasn  judgments 
often  seen  to  follow  algebraic  rules  (cf .  Anderson,  1974).  The  "averaging  rule" 
for  Inference  judgeenta  is  aerely  ana  case  in  point.  But  algebraic  aodels  of 
Judgaent  have  had  Halted  appeal  for  sons  Judgaent  researchers  because  they  have 
only  "as  if"  status:  The  data  look  as  if  they  have  been  produced  by  application 
of  an  algebraic  rule,  but  there  is  no  theoretical  necessity  that  the  psychological 
processes  of  the  judge  la  My  way  reaaable  "paper  and  pencil"  algebraic  manipula¬ 
tion. 

The  "debiasing"  research  reported  here  was  based  on  a  procedural  theory  of 
how  people  generate  data  that  have  algebraic  patterns.  The  approach  was  based 
on  the  assuuptlon  that  if  a  person's  JudgnMts  In  Inference  tasks  look  note  like 
overages  than  like  Inferences,  then  at  sons  point  during  the  Judgment  process 
the  person  aust  be  performing  one  or  more  operations  that  are  nearer  to  those 
required  for  averaging  thm  they  ere  to  those  required  for  InferMclng. 

Debiasing,  then,  aust  Involve  discovering  those  inappropriate  judgment  operations 
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and  raplacing  than  by  operations  that  ara  battar  suited  to  Inference. 

Figure  7  outlines  the  major  steps  that  are  hypothesised  to  occur  during 
judgment.  In  the  first  step,  scanning,  the  judge  merely  aasesees  what  Informa¬ 
tion  has  bean  presented  for  judgment.  Obviously,  the  details  of  the  scanning 
step  will  depend  on  the  teak  itself.  In  teaks  such  as  that  used  In  Experlasmt  1 
where  atlmulua  Information  is  presented  sequentially,  scanning  will  be  rudi¬ 
mentary  since  there  is  only  one  stimulus  to  scan.  In  simultaneous  tasks,  however, 
•canning  will  be  more  dearly  distinguishable  from  other  judgment  operations. 

In  some  tasks  (such  as  that  used  In  Experiment  2)  where  the  number  of  stimulus 
items  is  small  and  where  there  is  no  a  priori  reason  to  suppose  that  any  particular 
item  will  be  more  Important  than  any  other  item,  scanning  will  Include  all 
available  items,  with  order  of  scanning  determined  by  stlamlus  formatting  factors. 
In  other  cases,  however,  such  as  judging  applications  for  graduate  admission, 
soma  items  nay  be  consistently  scanned  before  other  items  (i.e.,  GPA,  GRE  scores, 
etc.),  and  sosm  items  may  be  not  scanned  at  all,  at  least  not  on  the  Initial 
pass  over  an  application  (i.e.,  applicant's  hobbies,  past  employment,  etc.). 

It  is  assumed  that  the  scanning  operation  is  primarily  aimed  at  orienting 
the  judge  to  the  available  information.  Although  the  judge  may  develop  a  rough 
impression  of  the  stimulus  from  scanning  it,  this  impression  will  not  In  general 
be  the  final  response.  There  may,  of  course,  be  exceptions  to  this  rule. 

For  example,  if  In  scanning  a  graduate  application,  the  judge  notices  that  the 
candidate  is  clearly  below  standard  on  some  critical  factor,  that  application 
may  be  Immediately  rejected.  Nevertheless,  many  experimental  tasks  Implicitly 
or  explicitly  rule  out  such  snap  judgments  by  cautioning  the  judge  against 
■taking  overly  hasty  "end-responses.” 

Once  the  judge  has  scanned  the  available  Information,  he  or  she  is  hypo¬ 
thesized  to  select  an  item  to  use  as  an  "anchor  point”  (cf.  E  in  horn  &  Hogarth, 

1982;  Lopes  i  Johnson,  1982;  Lopes  &  Oden,  1981;  Tversky  &  Kahneman,  1974). 

If  only  one  item  has  been  presented,  of  course,  that  item  must  be  the  anchor. 

But  if  more  than  one  item  is  available  the  "anchor  stimulus"  will  generally  be 
chosen  because  it  seems  relatively  more  Important  than  the  others.  Such 
importance  may  reflect  the  £  priori  Importance  cf  the  category  to  which  the 
item  belongs  as,  for  example,  GPA  for  graduate  admissions.  But  it  may  also 
reflect  diagnostlclty  within  category  as,  for  example,  when  items  are  selected 
by  virtue  of  their  being  very  extreme.  Only  if  the  various  items  seem  equally 
Important  will  the  subject  resort  to  ad  hoc  choice  schemes  such  as,  for  example. 


taking  tha  itaas  in  order  as  they  appear  In  the  stlaulus  array. 

Once  an  anchor  has  been  chosen,  it  oust  be  evaluated  relative  to  the 
scale  of  judgnant.  This  "valuation"  operation  nay  In  sone  cases  yield  a  quantity 
that  serves  directly  as  the  initial  judgnant.  For  exaaple,  previous  research 
on  the  inference  task  used  In  the  preaent  experiments  suggests  that  subjects  nay 
simply  anchor  their  judgnant  at  a  scale  position  that  is  proportional  to  the 
number  of  rejects  in  the  first  sample  (Lopes,  1981).  In  other  cases,  however, 
the  initial  judgnant  nay  be  somewhat  less  extreme  than  the  scale  value  associated 
with  the  anchor  stimulus.  In  these  situations  subjects  act  as  though  their 
initial  judgment  is  a  compromise  between  the  value  of  the  stimulus  information 
and  some  internal,  neutral  "initial  inprasalon"  (Anderson,  1967) . 

Once  anchoring  has  bean  accomplished  the  subject  muat  decide  whether  there 
are  still  important  items  left  to  be  judged.  If  so,  the  process  essentially 
reiterates,  with  the  subject  choosing  which  of  the  remaining  items  to  consider 
next.  As  can  be  seen  in  the  figure,  the  considerations  at  this  point  are 
exactly  what  they  were  at  the  time  of  choosing  the  anchor:  If  one  of  the  remaining 
items  is  clearly  nore  important  than  the  others,  tha  subject  chooses  it,  other¬ 
wise  an  item  is  chosen  arbitrarily,  and  the  scale  value  of  the  chosen  item  is  then 
determined. 

The  next  step  in  the  process  is  "adjustment"  of  the  initial  value  in  light 
of  the  new  information.  It  is  this  step* that  is  seen  as  being  nost  crucial  in 
determining  the  algebraic  form  of  the  overall  judgnant.  In  the  case  of  "averaging" 
rules,  the  adjustment  operation  is  assured  to  involve  two  stages:  (1)  location 
of  the  new  information  on  the  scale  of  Judgment  relative  to  the  initial  judgnant, 
and  (2)  adjuatmant  of  the  initial  judgnant  toward  the  new  information.  This 
produces  a  new  judgment  that  lies  between  the  first  two  values  and  is,  in  that 
sense,  an  average  of  the  two.  Other  algebraic  rules  esn  also  result  from  the  adjust¬ 
ment  stage,  but  the  particulars  of  their  respective  adjustment  processes  differ 
in  important  ways.  For  example,  nultiplying  can  be  seen  as  a  form  of  serial 
fractionation  (Lopes,  1976;  Lopes  &  Ekberg,  1980)  in  which  adjustments  to  the 
initial  value  are  always  downward  (toward  a  zero-point  on  the  response  scale) 
and  directly  in  proportion  to  the  subjective  value  of  the  stimulus  being  adjusted 
for.  In  the  same  way,  ratio  responses,  such  as  those  produced  by  trained  subjects 
in  Experiment  2,  can  be  seen  as  involving  adjustments  that  reflect  the  degree  to 
which  new  information  supports  or  dlsconf iras  the  qualitative  impact  of  previous 
information. 


After  each  adjustment  step,  the  judge  is  assumed  to  consider  whether  there 
are  still  important  items  left  unaccounted  for.  If  there  are,  the  process  is 
repeated  with  each  new  item  of  information  leading  to  adjustment  of  the  previous 
judgment,  and  with  the  adjustments  ordinarily  becoming  smaller  as  the  perceived 
Importance  of  the  new  information  becomes  smaller  in  relation  to  previously 
considered  information.  At  some  point,  however,  either  when  the  stimulus  informa¬ 
tion  runs  out  or  when  the  subject  judges  that  nothing  important  remains  to  be 
considered,  the  final  response  is  made  on  whatever  scale  has  been  provided. 

Changing  What  Judges  Do 

In  order  to  debias  human  judgments,  it  is  necessary  to  change  the  judgment 
process.  But  how?  As  Flschhoff  (1982)  pointed  out,  that  depends  on  one's 
theory  of  why  the  judgments  are  biased.  The  earliest  attempts  at  debiasing 
Bayesian  inference  were  based  generally  on  global  notions  of  why  the  bias 
occurred:  l.a.,  subjects  were  poorly  motivated,  or  misunderstood  the  instructions, 
or  refused  to  use  the  response  scale  properly.  These  causal  models  had  in  common 
that  they  implicitly  assumed  that  the  bias  could  be  "fixed"  without  the  necessity 
of  knowing  how  the  subject  actually  generated  the  biased  judgment. 

The  present  approach  differs  from  these  early  methods  In  that  it  rests  on 
an  analysis  of  what  the  subject  does  when  he  or  she  erroneously  produces  an 
average  rather  than  an  inference.  In  this  view,  the  judgment  process  is  seen  as 
comprising  a  set  of  procedures  for  scanning,  selecting,  analyzing  and,  finally, 
integrating  stimulus  information.  The  procedures  are  quantitative,  but  not 
numerical;  computational  but  not  arithmetical.  The  procedures  function  in  such 
a  way  that  judgments  can  be  described  fairly  as  the  result  of  an  averaging 
process,  but  the  "algebra”  is  implicit  in  the  subject's  actions  rather  than 
explicit  in  conscious  awareness. 

In  debiasing  the  present  task,  the  first  step  was  to  understand  exactly 
what  subjects  did  in  producing  their  biased  judgments.  Then  it  remained  only 
to  Identify  the  faulty  procedures  snd  to  replace  these  by  similar— but  normatively 
more  appropriate— procedures.  What  is  surprising  is  that  for  once  the  debiasing 
was  even  easier  to  do  than  to  imagine,  due  in  part  to  the  fact  that  the  training 
not  only  replaced  "bad"  procedures  but,  apparently,  helped  subjects  to  better 
understand  the  task. 

Only  one  other  debiasing  method,  that  of  Eils,  Seaver,  and  Edwards  (1977), 
has  been  as  successful  as  the  presant  method  in  improving  the  performance  of 
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naive  subjects  In  the  Bayesian  task.  On  the  face  of  it,  there  are  profound 
differences  between  these  two  successful  approaches,  not  the  least  of  which  is 
that  Ells  et  al.  engineer  the  task  to  fit  the  subjects  whereas  the  present 
approach  engineers  the  subjects  (or  at  least  their  judgment  procedures)  to  fit 
the  task.  In  a  deeper  sense,  however,  the  two  approaches  have  much  in  common 
because  they  are  both  based  on  an  understanding  of  procedures  that  subjects 
use  when  they  generate  biased  responses. 

Eils  et  al.  base  their  approach  on  the  eapirlcal  observation  that  subjects, 
by  whatever  means,  produce  data  that  are  more  like  averages  than  like  Inferences. 
They  cleverly  turn  this  "error"  to  their  advantage  by  recasting  the  task  so  that 
subjects  are  asked  to  do  what  they  do  naturally  and  well,  namely,  averaging. 

It  then  requires  only  a  simple  mechanical  transformation  to  convert  the  subjects' 
"average  likelihood  ratios"  into  "cumulative  likelihood  ratios." 

The  present  approach  is  also  based  oo  an  understanding  of  the  averaging 
process,  but  the  focus  is  shifted  from  the  algebraic  form  of  the  data  to  the 
micro structure  of  the  process  that  generates  the  data.  The  tacit  assumption  is 
that  although  subjects  have  access  to  components  of  the  judgment  process  (and 
hence  that  they  can  control  the  sequence  in  which  components  are  executed  and  the 
content  on  which  they  operate),  they  do  not  have  assess  to  the  algebraic  implica¬ 
tions  of  what  their  procedures  do.  Thus,  there  is  little  point  in  enjoining 
subjects  to  be  less  conservative,  or  to  report  their  "true"  probabilities,  or  to 
multiply  rather  than  average. 

Serious  engineering  in  any  domain  rests  on  knowledge  of  the  medium  to  be 
engineered.  In  the  case  of  judgmental  engineering,  one  must  understand  the 
judge  as  well  as  the  judgment.  People  do  some  things  well,  other  things  badly; 
they  have  access  to  some  procsssss  and  not  to  others.  Building  better  judges 
requires  that  we  know  what  people  do  and  how  they  do  it.  After  that,  debiasing 
is  easy. 
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TABLE  1 

NUMBER  OF  ADJUSTMENT  ERRORS 
EXPERIMENT  1 


PAIRS 

CONTROL 

TRAINED 

MAXIMUM  POSSIBLE 

Weak-strong 

0.40 

0.42 

24 

Strong-weak 

13.40 

3.11 

24 

Diagonal 

3.24 

1.08 

16 

Note.  Errors  were  scored  for  weak-strong  cells  and  strong-weak  cells  only  If 
there  was  actual  adjustment  and  It  was  In  the  wrong  direction.  Errors  were 
scored  for  diagonal  cells  only  If  there  was  room  for  adjustment  and  If  there 
was  adjustment  In  the  wrong  direction  or  no  adjustment. 
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13.  Lola  L.  Lopes  -  Averaging  Rules  and  Adjustment  Processes:  The  Role  of  Averaging 
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14.  Gregg  C.  Oden  -  Integration  of  Linguistic  Information  in  Language  Comprehension. 

April  1982. 
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